Lab2 tutorial

Yue You

2021-07-01

Lab 2 Overview

First Half

More Visualisations:

  • Heatmaps

  • HexBins vs SquareBins

  • Choropleths

Descriptive Statistics:

  • Histograms and looking at the distribution shape.

  • How to “quick plot” dataframes using ggplot2.

  • Some methods of determining bin sizes.

Second Half

Revision:

  • Any code related questions for R

Let’s get started

Visualisations

  • Binned plots

  • Good visualizations vs Bad visualizastions

Personal Checklist for Visualisations and Dashboards:

  • Your visualisation needs to tell a story.

  • It should be intepretable without being overly verbose.

  • The scale and axis need to make sense (and you can assume the reader knows the difference between a normal scale vs log scale).

  • The choice of visualisation needs to make sense:

  1. Line plot vs Bar chart with non-numerical categories
  2. Scatterplot vs Histogram plot to see distribution
  3. etc
  • Choice of colour scheme / alpha / size need to be easy on the eyes.

At the end of the day, even if you think your visualisation is “pretty” or “beautiful”, if a reader cannot understand it, then it is not a good visualisation.

Load libraries

library(dplyr)
library(sf)
library(curl)
library(ggmap)
library(tmap)
library(tmaptools)
library(feather)

Load data

#try feather 
filepath = "/Volumes/you.y/MAST30034_R/data/df.feather"
df <- read_feather(filepath)
df %>% tail
## # A tibble: 6 x 19
##   VendorID tpep_pickup_dat… tpep_dropoff_da… passenger_count trip_distance
##      <int> <fct>            <fct>                      <int>         <dbl>
## 1        2 4/12/15 22:55    4/12/15 23:28                  2         12.8 
## 2        2 4/12/15 22:55    4/12/15 23:03                  1          0.75
## 3        1 4/12/15 22:55    4/12/15 23:08                  1          2.4 
## 4        1 4/12/15 22:55    4/12/15 23:01                  1          0.8 
## 5        2 4/12/15 22:55    4/12/15 23:17                  1          4.73
## 6        2 4/12/15 22:55    4/12/15 22:59                  2          0.8 
## # … with 14 more variables: pickup_longitude <dbl>, pickup_latitude <dbl>,
## #   RatecodeID <int>, store_and_fwd_flag <fct>, dropoff_longitude <dbl>,
## #   dropoff_latitude <dbl>, payment_type <int>, fare_amount <dbl>, extra <dbl>,
## #   mta_tax <dbl>, tip_amount <dbl>, tolls_amount <dbl>,
## #   improvement_surcharge <dbl>, total_amount <dbl>

Heatmaps

  • Excellent tool when visualising geospatial coordinates.

Select the map.

xranges <- range(df$pickup_longitude[!df$pickup_longitude==0])
yranges <- range(df$pickup_latitude[!df$pickup_latitude==0])
xranges
## [1] -77.04710 -71.06483
yranges
## [1] 37.27044 42.73614
map_big <- get_stamenmap(
  rbind(-74.3,40.45,-73.75,40.95), 
  zoom = 10)
ggmap(map_big)

Plot pickup locations

ggmap(map_big) + 
  geom_bin2d(data = df,
             aes(x = pickup_longitude,
                 y = pickup_latitude),
             bins=50) 

Change the color scale.

ggmap(map_big) + 
  stat_bin2d(data = df,
             aes(x = pickup_longitude,
                 y = pickup_latitude),
             bins=50) +
  scale_fill_viridis_c(trans = "log10")

ggmap(map_big) + 
  stat_bin2d(data = df,
             aes(x = pickup_longitude,
                 y = pickup_latitude),
             bins=50) +
  scale_fill_gradientn(colours = c("darkred", "orange", "yellow", "white"),trans="log10")

Filled by discrete factors.

ggmap(map_big) + 
  stat_bin2d(data = df,
             aes(x = pickup_longitude,
                 y = pickup_latitude,
                 fill = as.factor(passenger_count)),
             bins=50) 

Change the color palette. A good palette makes your story better.

library(RColorBrewer)
display.brewer.all()

ggmap(map_big) + 
  stat_bin2d(data = df,
             aes(x = pickup_longitude,
                 y = pickup_latitude,
                 fill = as.factor(payment_type)),
             bins=100) +
  scale_fill_brewer(palette = "Dark2")

ggmap(map_big) + 
  stat_bin2d(data = df,
             aes(x = pickup_longitude,
                 y = pickup_latitude,
                 fill = as.factor(payment_type),
                 alpha = 0.95),
             bins=100) +
  scale_fill_brewer(palette = "Dark2")

Hexagon bins avoid the visual artefacts sometimes generated by the very regular alignment of geom_bin2d().

Plot HexBin.

ggmap(map_big) + 
  stat_binhex(data=df,
           aes(x = pickup_longitude, y = pickup_latitude),
               bins=100) +
  coord_cartesian()  +
  scale_fill_gradient2(trans="log10")

ggmap(map_big) + 
  stat_binhex(data=df,
           aes(x = pickup_longitude, y = pickup_latitude),
           alpha = 0.9,
           bins=100) +
  coord_cartesian()  +
  scale_fill_viridis_c(trans = "log10") +
  facet_grid(.~payment_type)

Shapefile Overlays

  • NOTE: This only applies on the more recent datasets that use zones over coordinates

Requirements:

geopandas

Shapefile Links:

linked phrase linked phrase

df_yellow = read.csv("/Volumes/you.y/MAST30034_R/data/yellow_tripdata_2019-01.csv",stringsAsFactors = TRUE)
df_yellow %>% 
  group_by(PULocationID,passenger_count) %>% 
  summarise(n=sum(passenger_count)) -> df_yellow_summary
## `summarise()` regrouping output by 'PULocationID' (override with `.groups` argument)
library(sf)
sf = st_read("/Volumes/you.y/MAST30034_R/data/shapefile/taxi_zones/taxi_zones.shp")
## Reading layer `taxi_zones' from data source 
##   `/Volumes/you.y/MAST30034_R/data/shapefile/taxi_zones/taxi_zones.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 263 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 913175.1 ymin: 120121.9 xmax: 1067383 ymax: 272844.3
## Projected CRS: NAD83 / New York Long Island (ftUS)
zone = read.csv("/Volumes/you.y/MAST30034_R/data/shapefile/taxi+_zone_lookup.csv")

Basic information in shapefile.

sf = st_read("/Volumes/you.y/MAST30034_R/data/shapefile/taxi_zones/")
## Reading layer `taxi_zones' from data source 
##   `/Volumes/you.y/MAST30034_R/data/shapefile/taxi_zones' using driver `ESRI Shapefile'
## Simple feature collection with 263 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 913175.1 ymin: 120121.9 xmax: 1067383 ymax: 272844.3
## Projected CRS: NAD83 / New York Long Island (ftUS)
plot(sf)

** Choropleth with ggplot **

ggplot() + 
  geom_sf(data = sf, size = 1, color = "black", fill = "cyan1") + 
  ggtitle("Boundary Plot") + 
  coord_sf()+
  theme_bw()

** Choropleth with tmap **

tmap is specifically designed to make creation of thematic maps more convenient. It borrows from thw ggplot syntax and takes care of a lot of the styling and aesthetics. This reduces our amount of code significantly. We only need:

  • tm_shape() where we provide the sf object (we could also provide an SpatialPolygonsDataframe)

*tm_polygons() where we set the attribute variable to map, the break style, and a title.

tm_shape(sf)+
   tm_polygons("Shape_Area",style="quantile")

** Web mapping with leaflet **

library(leaflet) 

Incorporate the information we have.

df_yellow_summary$LocationID <- df_yellow_summary$PULocationID
full_join(sf,df_yellow_summary,by="LocationID") -> sf_full

While tmap was tolerant about our AEA projection of philly_crimes_sf, leaflet does require us to explicitly reproject the sf object.

# reproject
sf_WGS84 <- st_transform(sf_full, 4326)
leaflet(sf_WGS84) %>%
  addPolygons()
sf_WGS84 %>% filter(!is.na(n)) ->sf_plot
pal_fun <- colorQuantile("YlOrRd", NULL, n = 5)
p_popup <- paste0("<strong>Passanger number: </strong>", sf_plot$n)
leaflet(sf_plot) %>%
  addPolygons(
    stroke = FALSE, 
    fillColor = ~pal_fun(n),
    fillOpacity = 0.8, smoothFactor = 0.5,
    popup = p_popup) %>%
  addTiles() %>%
  addLegend("bottomright",  # location
            pal=pal_fun,    # palette function
            values=~n,  # value to be passed to palette function
            title = 'Passanger number') # legend title

Descriptive Statistics

filepath = "/Volumes/you.y/MAST30034_R/data/df.feather"
df <- read_feather(filepath)
summary(df)
##     VendorID        tpep_pickup_datetime   tpep_dropoff_datetime
##  Min.   :1.000   4/12/15 19:14:  508     4/12/15 21:00:  487    
##  1st Qu.:1.000   4/12/15 22:44:  506     4/12/15 21:40:  486    
##  Median :2.000   4/12/15 21:57:  503     4/12/15 20:14:  484    
##  Mean   :1.528   4/12/15 22:46:  491     4/12/15 20:58:  480    
##  3rd Qu.:2.000   4/12/15 19:39:  489     4/12/15 21:01:  480    
##  Max.   :2.000   4/12/15 19:28:  486     4/12/15 20:31:  476    
##                  (Other)      :97017     (Other)      :97107    
##  passenger_count trip_distance    pickup_longitude pickup_latitude
##  Min.   :0.000   Min.   : 0.000   Min.   :-77.05   Min.   : 0.00  
##  1st Qu.:1.000   1st Qu.: 1.030   1st Qu.:-73.99   1st Qu.:40.73  
##  Median :1.000   Median : 1.710   Median :-73.98   Median :40.75  
##  Mean   :1.704   Mean   : 2.827   Mean   :-72.88   Mean   :40.14  
##  3rd Qu.:2.000   3rd Qu.: 3.100   3rd Qu.:-73.97   3rd Qu.:40.77  
##  Max.   :6.000   Max.   :91.200   Max.   :  0.00   Max.   :42.74  
##                                                                   
##    RatecodeID     store_and_fwd_flag dropoff_longitude dropoff_latitude
##  Min.   : 1.000   N:99421            Min.   :-74.62    Min.   : 0.00   
##  1st Qu.: 1.000   Y:  579            1st Qu.:-73.99    1st Qu.:40.73   
##  Median : 1.000                      Median :-73.98    Median :40.75   
##  Mean   : 1.029                      Mean   :-72.93    Mean   :40.17   
##  3rd Qu.: 1.000                      3rd Qu.:-73.96    3rd Qu.:40.77   
##  Max.   :99.000                      Max.   :  0.00    Max.   :41.49   
##                                                                        
##   payment_type   fare_amount          extra            mta_tax       
##  Min.   :1.00   Min.   :-120.00   Min.   :-1.0000   Min.   :-0.5000  
##  1st Qu.:1.00   1st Qu.:   7.00   1st Qu.: 0.5000   1st Qu.: 0.5000  
##  Median :1.00   Median :  10.00   Median : 0.5000   Median : 0.5000  
##  Mean   :1.34   Mean   :  12.75   Mean   : 0.6049   Mean   : 0.4985  
##  3rd Qu.:2.00   3rd Qu.:  15.00   3rd Qu.: 0.5000   3rd Qu.: 0.5000  
##  Max.   :4.00   Max.   : 500.00   Max.   : 1.5000   Max.   : 2.5000  
##                                                                      
##    tip_amount      tolls_amount     improvement_surcharge  total_amount    
##  Min.   :  0.00   Min.   : 0.0000   Min.   :-0.3000       Min.   :-120.30  
##  1st Qu.:  0.00   1st Qu.: 0.0000   1st Qu.: 0.3000       1st Qu.:   9.30  
##  Median :  1.46   Median : 0.0000   Median : 0.3000       Median :  12.80  
##  Mean   :  1.80   Mean   : 0.2386   Mean   : 0.2997       Mean   :  16.19  
##  3rd Qu.:  2.55   3rd Qu.: 0.0000   3rd Qu.: 0.3000       3rd Qu.:  18.36  
##  Max.   :115.00   Max.   :24.0000   Max.   : 0.3000       Max.   : 550.30  
## 

Remember that not all the data should be interpreted as purely numerical. There may be conclusions you can draw by coincidence if you incorrectly assume data types! For example: * longitude and latitude should be interpreted as geospatial coordinates * payment_type is a discrete category of payment types * trip_distance is non-linear (not a straight line from A to B), but we have no further data on it. It is also in miles. * To avoid misinterpreting the attributes, refer to the data dictionary provided on the TLC website.

Scatterplot plots for Fare / Distance

ggplot(df) +
  aes(x=fare_amount,y=trip_distance) +
  geom_point()+
  theme_bw()

General Inference:

  • We can visually see that the relationship is relatively linear as you’d expect (more distance generally means more money)
  • A fair number of outliers, notably around the 0 distance axis and 0 cost axis. Why might this be the case?
  • Why are there negative values?

Histgram plots for Trip Distance

ggplot(df)+
  geom_histogram(aes(x=fare_amount),bins = 30) +
  theme_bw()

  • I guess we can kind of see most fares are between 0 - 100.
  • Hard to tell where the main distribution is spread around.
  • We can enhance this using a log transformation.
df$transform_fare_amount <- log(df$fare_amount)
ggplot(df)+
  geom_histogram(aes(x=transform_fare_amount),bins = 30) +
  theme_bw()

* Take a log transformation to visually see the distribution

How about the distribution for trips under 15 miles?

ggplot(df %>% filter(trip_distance <=15))+
  geom_histogram(aes(x=transform_fare_amount),bins = 30) +
  theme_bw()

df %>% 
  mutate(trip_dist_15 = ifelse(trip_distance>15,"yes","no")) %>% 
  ggplot()+
  geom_histogram(aes(x=transform_fare_amount,
                     fill=trip_dist_15),
                 bins = 30) +
  scale_fill_brewer(palette = "Set2") +
  theme_bw()

cor(df$trip_distance,df$fare_amount,method = "pearson")
## [1] 0.899978
library(pheatmap)
#remove characters
pheatmap(cor(df[,-c(1:3,9)]),
         cluster_rows = FALSE,
         cluster_cols = FALSE)

  • trip_distance highly correlates with high tips, tolls and overall trip amount
  • payment_type seems to have some form of negative correlation with tip_amount. Must be careful as this is a discrete category.
cors =c("passenger_count", "trip_distance", "fare_amount", "extra", 
 "mta_tax", "tip_amount", "tolls_amount", "improvement_surcharge", "total_amount")
cor(df[,cors])
##                       passenger_count trip_distance   fare_amount        extra
## passenger_count          1.0000000000   0.001658362  0.0009505592 -0.003093358
## trip_distance            0.0016583622   1.000000000  0.8999780236 -0.223323624
## fare_amount              0.0009505592   0.899978024  1.0000000000 -0.207022734
## extra                   -0.0030933583  -0.223323624 -0.2070227338  1.000000000
## mta_tax                  0.0131316387  -0.027817098 -0.1282543385  0.130305153
## tip_amount              -0.0111149872   0.530623359  0.5573861481 -0.100025231
## tolls_amount             0.0065090920   0.572059235  0.5307124782 -0.157824284
## improvement_surcharge    0.0040463644   0.015943787  0.0489414626  0.095486378
## total_amount            -0.0007104667   0.898539215  0.9824415224 -0.186521180
##                           mta_tax  tip_amount tolls_amount
## passenger_count        0.01313164 -0.01111499  0.006509092
## trip_distance         -0.02781710  0.53062336  0.572059235
## fare_amount           -0.12825434  0.55738615  0.530712478
## extra                  0.13030515 -0.10002523 -0.157824284
## mta_tax                1.00000000 -0.06226707 -0.094005581
## tip_amount            -0.06226707  1.00000000  0.359383270
## tolls_amount          -0.09400558  0.35938327  1.000000000
## improvement_surcharge  0.59152689  0.01722265  0.004483847
## total_amount          -0.12168476  0.68568923  0.603590674
##                       improvement_surcharge  total_amount
## passenger_count                 0.004046364 -0.0007104667
## trip_distance                   0.015943787  0.8985392152
## fare_amount                     0.048941463  0.9824415224
## extra                           0.095486378 -0.1865211796
## mta_tax                         0.591526892 -0.1216847619
## tip_amount                      0.017222652  0.6856892276
## tolls_amount                    0.004483847  0.6035906745
## improvement_surcharge           1.000000000  0.0486015276
## total_amount                    0.048601528  1.0000000000
pheatmap(cor(df[,cors]))

Use Spark in R

See sparklyr tutorial.